Assignment 4: Exploring Yelp Reviews

NOTE: THIS ASSIGNMENT IS OPTIONAL

You must complete one out of homeworks #4, #5, and #6.

Due date: Monday, 11/1 by the end of day

In this assignment, we'll explore restaurant review data available through the Yelp Dataset Challenge. The dataset includes Yelp data for user reviews and business information for 10 metropolitan areas. The data directory in this repository includes data files for reviews and restaurants in 3 of these cities: Cleveland, Pittsburgh, and Charlotte. These cities were chosen since the data is not too large — the data for the other cities can be downloaded from the Yelp download page. For this assignment, you are welcome to analyze data any of the three cities.

This assignment is broken into two parts:

Part 1: testing how well sentiment analysis works.

Because Yelp reviews include the number of stars given by the user, the Yelp data set provides a unique opportunity to test how well our sentiment analysis works by comparing the number of stars to the polarity of reviews.

Part 2: analyzing correlations between restaurant reviews and census data

We'll explore geographic trends in the restaurant reviews, comparing our sentiment analysis results with user stars geographically. We'll also overlay review stars on maps of household income (using census data).

Background readings

1. Does Sentiment Analysis Work?

In this part, we'll load the data, perform a sentiment analysis, and explore the results.

1.1 Load review data

You can choose data from Cleveland, Charlotte, or Pittsburgh. The data is stored as a JSON file and you can use pandas.read_json to load it.

Notes

The JSON data is in a "records" format. To load it, you'll need to pass the following keywords:

  • orient='records'
  • lines=True
In [1005]:
import pandas as pd
import geopandas as gpd
import nltk
import string
import textblob
import seaborn as sns
import re
import contextily as cx
from matplotlib import pyplot as plt
import numpy as np
import cenpy
import hvplot.pandas
import geoviews as gv
import geoviews.tile_sources as gvts
import holoviews as hv
from sklearn.linear_model import LinearRegression
from mpl_toolkits.axes_grid1 import make_axes_locatable
from holoviews import opts
from shapely.geometry import Polygon
nltk.download('stopwords')
hv.extension('bokeh', 'matplotlib');
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shaun\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [1006]:
review_clv = pd.read_json("./data/reviews_cleveland.json.gz",orient="records",lines=True)

1.2 Format the review text

The first step is to split the review text into its individual words and make all of the words lower-cased.

Add a new column, called 'formatted_text', which each entry is a list of the lower-cased words in a review.

In [1007]:
review_clv["formatted_text"] = review_clv["text"].str.lower()
review_clv["formatted_text"] = review_clv["formatted_text"].str.findall( r'\w+|[^\s\w]+')

1.3 Review stop words

Use the nltk library to remove any stop words from the list of words in each review.

Overwrite the 'formatted_text' column to contain a list of lower-cased words in each review, with no stop words.

In [1008]:
stop_words = list(set(nltk.corpus.stopwords.words('english')))
punctuation = list(string.punctuation)
ignored = stop_words + punctuation
In [1009]:
review_clv["formatted_text"] = [[i for i in b if i not in ignored] for b in review_clv["formatted_text"]]
for p in punctuation:
    review_clv["formatted_text"] = [[i for i in b if not i.startswith (p)] for b in review_clv["formatted_text"]]

1.4 Calculate polarity and subjectivity

Using the formatted text column, create a list of textblob.TextBlob() objects and then extract the subjectivity and polarity.

Add two new columns to the review DataFrame: polarity and subjectivity.

Note: the polarity/subjectivity calculation can take several minutes to run

In [1010]:
blobs = [textblob.TextBlob(" ".join(b)) for b in review_clv["formatted_text"]]
In [1011]:
review_clv['polarity'] = [blob.sentiment.polarity for blob in blobs]
review_clv['subjectivity'] = [blob.sentiment.subjectivity for blob in blobs]

1.5 Comparing the sentiment analysis to number of stars

Use seaborn to make two box plots, one showing the polarity vs number of user stars and one showing the subjectivity vs the number of user stars.

In [1013]:
# instead of making two plots, a side-by-side plot is made to display the comparison
figure, axes = plt.subplots(1, 2)
figure.suptitle("Polarity & Subjectivity Plots based on Review Stars",fontsize=25)
flierprops = dict(marker='+', markerfacecolor='black', markersize=1, markeredgecolor='black')
sns.set_style("whitegrid")
sns.set(rc={'axes.facecolor':'#F0F0F1'})

boxplot1=sns.boxplot(
    y='polarity', 
    x='stars', 
    linewidth=1,
    data=review_clv, 
    color="#5687BA",
    ax=axes[0],
    width=0.6,
    flierprops=flierprops)

boxplot2=sns.boxplot(
    y='subjectivity', 
    x='stars', 
    linewidth=1,
    data=review_clv, 
    color="#E23446",
    ax=axes[1],
    width=0.6,
    flierprops=flierprops)

boxplot1.set_xlabel("Stars", fontsize=14)
boxplot1.set_ylabel("Polarity", fontsize=14)
boxplot2.set_xlabel("Stars", fontsize=14)
boxplot2.set_ylabel("Subjectivity", fontsize=14)
axes[0].axhline(y=0, c='k', lw=2,linestyle='--') 
axes[1].axhline(y=0, c='k', lw=2,linestyle='--') 

for patch in boxplot1.artists:
 r, g, b, a = patch.get_facecolor()
 patch.set_facecolor((r, g, b, .6))
for patch in boxplot2.artists:
 r, g, b, a = patch.get_facecolor()
 patch.set_facecolor((r, g, b, .6))
for line in boxplot1.get_lines():
   line.set_color('black')
for line in boxplot2.get_lines():
   line.set_color('black')

plt.rcParams["figure.figsize"] = [16,10]

Question: What do your charts indicate for the effectiveness of our sentiment analysis?

Answer: The polarities rise as stars go up, however the subjectivities rise relatively slowly. This indicates the sentiment analysis is not accurate. Because polarities should corresponde with subjectivities, rising at the similar speed.

1.6 The importance of individual words

In this part, we'll explore the importance and frequency of individual words in Yelp reviews.

We will identify the most common reviews and then plot the average polarity vs the user stars for the reviews where those words occur.

1.6.1 Select a random sample of the review data

Select 1,000 random rows from the DataFrame holding the review data. Use the .sample() function to perform the selection.

In [1014]:
review_clv_1000 = review_clv.sample(n = 1000,random_state=1)

1.6.2 Re-format the data

Pass the subset of review data from the previous part to the reshape_data() function defined below. Explore the result of this function, and in one or two sentences, explain the operation performed by reshape_data().

In [1015]:
def reshape_data(review_subset):
    """
    Reshape the input dataframe of review data.
    """
    from pandas import Series, merge
    
    X = (review_subset['formatted_text']
         .apply(Series)
         .stack()
         .reset_index(level=1, drop=True)
         .to_frame('word'))
    
    
    R = review_subset[['polarity', 'stars', 'review_id']]
    
    return merge(R, X, left_index=True, right_index=True).reset_index(drop=True)
In [1017]:
review_clv_1000_2 = review_clv_1000
review_clv_1000_2_reshape = reshape_data(review_clv_1000_2)
In [1018]:
review_clv_1000_2_reshape
Out[1018]:
polarity stars review_id word
0 0.187662 3 YPclH3KxQEgB5nI_RseVbw dinner
1 0.187662 3 YPclH3KxQEgB5nI_RseVbw came
2 0.187662 3 YPclH3KxQEgB5nI_RseVbw occasion
3 0.187662 3 YPclH3KxQEgB5nI_RseVbw late
4 0.187662 3 YPclH3KxQEgB5nI_RseVbw night
... ... ... ... ...
58354 0.322500 4 NDSMnfUhrpajMkiDAUKNDw another
58355 0.322500 4 NDSMnfUhrpajMkiDAUKNDw time
58356 0.322500 4 NDSMnfUhrpajMkiDAUKNDw try
58357 0.322500 4 NDSMnfUhrpajMkiDAUKNDw additional
58358 0.322500 4 NDSMnfUhrpajMkiDAUKNDw dishes

58359 rows × 4 columns

Question: what is the operation performed by the reshape_data() function?

Overall: Creating rows for every single word within the same review_id of the dataframe

1.extract every word as a column

2.using stack function to store words in index order, as a dataframe

3.combine these two dataframes together using the index

4.reallocate indexes

1.6.3 Calculate the average number of stars and polarity for each word

Using the result from 1.6.2, group by the "word" column, and calculate the following three quantities:

  1. the size of each group
  2. the average number of user stars for each word
  3. the average polarity for each word

Combine these three results into a single DataFrame object.

Hint: you can combine the three results using either the pandas.concat() or the pandas.merge() function.

In [1019]:
review_clv_1000_2_reshape_text =review_clv_1000_2_reshape
review_clv_1000_2_reshape_text['count'] = 1
In [1023]:
size = review_clv_1000_2_reshape_text.groupby(["word"],as_index=False)['count'].sum()
avg_stars =  review_clv_1000_2_reshape_text.groupby(["word"],as_index=False)['stars'].mean()
avg_polarity =  review_clv_1000_2_reshape_text.groupby(["word"],as_index=False)['polarity'].mean()
summary = pd.merge(size, avg_stars, on="word")
summary = pd.merge(summary, avg_polarity, on="word")
summary
Out[1023]:
word count stars polarity
0 0 5 2.000000 0.058358
1 00 13 3.461538 0.179259
2 000 2 2.500000 0.227037
3 05 1 4.000000 0.336364
4 09 1 4.000000 0.212240
... ... ... ... ...
7802 很不错的一家店 1 5.000000 0.358333
7803 很有家乡的味道 1 5.000000 0.358333
7804 立地は良いと思った 1 2.000000 0.000000
7805 飲み物もオーダーしてから10分以上も待たせる 1 2.000000 0.000000
7806 2 5.000000 0.358333

7807 rows × 4 columns

1.6.4 Select words the occur at least 50 times in reviews

Trim your DataFrame from the last section to only include words that occurred at least 50 times. Remember, when you grouped by the 'word' column, the size() function told you how many times each word occurred.

In [1024]:
filtered=summary.loc[(summary["count"]>=50)]

1.6.5 Plot the average polarity vs user stars

Use matplotlib to make a scatter plot of the average user stars vs average polarity for the words in the data frame from the last section. This will involve two steps:

Loop over each row of the data frame from the last section and for each row:

  1. Use plt.scatter(x, y) to plot a scatter plot, where x is polarity and y is stars.
  2. Use plt.text(x, y, word) to add the corresponding word to each scatter marker.

Using the data frame from section 1.4, add vertical and horizontal lines to your chart that shows the average number of user stars and the average polarity across all reviews in the data set.

Make sure the figure is big enough so that you can make out some of the words, especially at low and high polarity values. You should be able to see a strong trend between polarity and user stars, and some of the most common words occurring in these reviews.

In [1031]:
mean_polarity=review_clv['polarity'].mean()
mean_stars=review_clv['stars'].mean()
print(mean_polarity,mean_stars)
0.25948875457930276 3.7636730794810096
In [1034]:
# Plot Scatter
fig, ax = plt.subplots(figsize=(90,55))
size_scatter=12*filtered['count']
plt.scatter('polarity', 'stars', s=size_scatter,
            data=filtered,color="red", alpha=0.5,
           edgecolors="white", linewidth=2)
plt.xlim(0.05, 0.42)
plt.ylim(2, 4.5)
plt.grid(color='#A8A8A8', lw=5,linestyle='dashed')

# Plot Text
for a in range(len(filtered)):
    ax.text(x=filtered.iloc[a,3]+0.001,y=filtered.iloc[a,2],s=filtered.iloc[a,0])

# Setting
ax.axvline(x=mean_polarity, c='k', lw=10,linestyle='dashed')
ax.axhline(y=mean_stars, c='k', lw=10,linestyle='dashed')
ax.set_xlabel("Polarity", fontsize=75)
ax.set_ylabel("Stars", fontsize=75);
ax.set_title("Polarity as a function of Reviews' stars", fontsize=100);
ax.tick_params(axis='both', which='major', labelsize=65)
ax.spines["left"].set_color("black")
ax.spines["bottom"].set_color("black")

# Regression Line
linear_regressor = LinearRegression()
polarity=filtered['polarity'].values
stars=filtered['stars'].values
polarity=polarity.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(np.log(polarity), stars)
x_pred = np.log(np.linspace(0.05, 0.42, num=200).reshape(-1, 1))
y_pred = linear_regressor.predict(x_pred)
ax.plot(np.exp(x_pred), y_pred, color="#696969", lw=10)


plt.rcParams['axes.facecolor'] = 'white'

2. Correlating restaurant data and household income

In this part, we'll use the census API to download household income data and overlay restaurant locations.

2.1 Query the Census API

Use the cenpy package to download median household income in the past 12 months by census tract from the 2018 ACS 5-year data set for your county of interest.

You have two options to find the correct variable names:

At the end of this step, you should have a pandas DataFrame holding the income data for all census tracts within the county being analyzed.

Hints

The FIPS codes for the various state/counties are:

  • Pittsburgh
    • PA code: '42'
    • County code: '003' (Allegheny County)
  • Cleveland
    • OH code: '39'
    • County code: '035' (Cuyahoga County)
  • Charlotte
    • NC code: '37'
    • County code: '119' (Mecklenburg County)
In [1035]:
acs = cenpy.remote.APIConnection("ACSDT5Y2018")
In [1036]:
clv_MedHHInc_tract = acs.query(
    cols=["NAME", "B19013_001E"],
    geo_unit="tract:*",
    geo_filter={
                "state" : "39", 
                "county" : "035"
               },
).rename(columns={"B19013_001E": "MedHHInc"}, errors="raise")

2.2 Download census tracts from the Census and merge the data from Part 2.1

  • Use the cenpy to set the correct map service and download census tracts for the desired geography
  • Merge the downloaded census tracts with the household income DataFrame, making sure to specify the proper columns to perform the merge on.
In [1037]:
acs.set_mapservice("tigerWMS_ACS2018")
where_clause = "STATE = 39 AND COUNTY = 035"
clv_tracts = acs.mapservice.layers[8].query(where=where_clause)
D:\Miniconda_Python\envs\musa-550-fall-2021\lib\site-packages\pyproj\crs\crs.py:68: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
In [1038]:
clv_MMedHHInc_M = clv_tracts.merge(
    clv_MedHHInc_tract,
    left_on=["STATE", "COUNTY", "TRACT"],
    right_on=["state", "county", "tract"],
).loc[:, ['geometry', 'NAME_y', 'MedHHInc','state','county','tract']].to_crs(epsg=32617)

clv_MMedHHInc_M=clv_MMedHHInc_M.rename(columns={"NAME_y": "NAME"}, errors="raise")

clv_MMedHHInc_M['MedHHInc'] = clv_MMedHHInc_M['MedHHInc'].astype(float).round()

clv_MMedHHInc_M = clv_MMedHHInc_M[clv_MMedHHInc_M['MedHHInc']>0]

2.3 Plot a choropleth map of the household income

Use the built-in geopandas plot() function.

Be sure to convert to a reasonable CRS first!

In [1040]:
fig, ax = plt.subplots(figsize=(20,20))
clv_MMedHHInc_M.plot(
    ax=ax, 
    column='MedHHInc',
    legend=True,
    cmap='viridis',
    scheme='quantiles', 
    alpha=0.4, 
    edgecolor='k'
)

cx.add_basemap(ax,zoom=12, crs=clv_MMedHHInc_M.crs, source=cx.providers.OpenStreetMap.Mapnik)
ax.set_axis_off()
ax.set_title("Map of Median Household Income, Cleveland", fontsize=25);

2.4 Load the restaurants data

Use the latitude and longitude columns to create a GeoDataFrame after loading the JSON data.

Notes

The JSON data is in a "records" format. To load it, you'll need to pass the following keywords:

  • orient='records'
  • lines=True
In [1041]:
clv_res=pd.read_json("./data/restaurants_cleveland.json.gz",orient="records",lines=True)
In [1042]:
clv_res['geometry'] = gpd.points_from_xy(clv_res['longitude'], clv_res['latitude'])
clv_res = gpd.GeoDataFrame(clv_res, geometry='geometry', crs="EPSG:4326")
clv_res=clv_res.to_crs(epsg=32617)

2.5 Overlay restaurants on the income map

Overlay the restaurants and color the points according to the 'stars' column.

You can use the 'coolwarm' color map: blue points will have below-average reviews and red points will have above-average stars.

Hint

You can use the .geometry.total_bounds attribute to get the axes limits of the county's census tracts.

[xmin, ymin, xmax, ymax] = income.geometry.total_bounds

You can then use these limits to set the matplotlib plot limits accordingly.

In [1045]:
Income = clv_MMedHHInc_M.hvplot(c='MedHHInc',
                      frame_width=800, 
                      frame_height=600,
                      crs=32617,
                      cmap='viridis',
                      alpha=0.7,
                      dynamic=False)

Restaurant = clv_res.hvplot(
                      frame_width=800, 
                      frame_height=600,  
                      crs=32617, 
                      c="stars",
                      hover_cols=['name','stars'],
                      alpha=0.9,
                      dynamic=False).options(cmap=["#FFEE00","#F55368"])

combination3 = gvts.Wikipedia * Income * Restaurant

combination3.opts(
    opts.WMTS(width=800, height=600, xaxis=None, yaxis=None),
    opts.Overlay(title="Map of Restaurant and Median Household Income"))
Out[1045]:

2.6 Comparing polarity vs. stars geographically

  • Merge the restaurants GeoDataFrame with the DataFrame with the 'polarity' column for each review.
  • Make a side-by-side plot with two columns: one subplot shows hex bins giving the polarity of the restaurant review and the other shows hex bins giving the number of stars

As we saw in Section 1, you should see strong correlation between the two subplots.

Hints

  • The 'business_id' column should be present in both the data frames holding review data and restaurant data.
  • See the plt.subplots() function for creating a figure with 2 subplots.
In [1046]:
review_clv_2 = review_clv.loc[:,["business_id","polarity"]]
mergedf = pd.merge(clv_res, review_clv_2, on="business_id")
In [1049]:
# Set canvas
fig, axs = plt.subplots(ncols=2,figsize=(20,10))
ax1=axs[0]
ax2=axs[1]

# hexbin coordinate
xcoords = mergedf.geometry.x
ycoords = mergedf.geometry.y
polarity = mergedf.polarity

# Tract plot
clv_MMedHHInc_M.plot(ax=ax1,facecolor="none", edgecolor="black", linewidth=0.25)
clv_MMedHHInc_M.plot(ax=ax2,facecolor="none", edgecolor="black", linewidth=0.25)

# Hexbin plot
hex_vals1 = ax1.hexbin(
    xcoords, 
    ycoords, 
    gridsize=30,
    C=mergedf.polarity,
    reduce_C_function=np.median)

hex_vals2 = ax2.hexbin(
    xcoords, 
    ycoords, 
    gridsize=30,
    C=mergedf.stars,
    reduce_C_function=np.median)

ax1.set_title("Map of Review Polarities, Celveland", fontsize=18);
ax2.set_title("Map of Restaurant, Celveland", fontsize=18);

ax1.set_axis_off()
ax2.set_axis_off()

# Color bar
divider1 = make_axes_locatable(ax1)
cax1 = divider1.append_axes('right', size='5%', pad=0.05)
colorbar_polarity=fig.colorbar(hex_vals1, cax=cax1, orientation='vertical')

divider2 = make_axes_locatable(ax2)
cax2 = divider2.append_axes('right', size='5%', pad=0.05)
colorbar_stars=fig.colorbar(hex_vals2, cax=cax2, orientation='vertical')

colorbar_polarity.ax.set_title('Polarity')
colorbar_stars.ax.set_title('Star')

# Basemap plot
cx.add_basemap(ax1,zoom=13, crs=clv_MMedHHInc_M.crs, source=cx.providers.OpenStreetMap.Mapnik)
cx.add_basemap(ax2,zoom=13, crs=clv_MMedHHInc_M.crs, source=cx.providers.OpenStreetMap.Mapnik)